Wine Quality EDA by Chip Reeves

Univariate Plots Section

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Univariate Analysis

What is the structure of your dataset?

1599 observations. 11 attributes + 1 output attribute

What is/are the main feature(s) of interest in your dataset?

Wine quality is the output of interest. It is on a 1-10 scale with 10 being good.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

11 physiochemical inputs that could possibly be used to predict wine quality are also captured.

Did you create any new variables from existing variables in the dataset?

Yes. I used a log transformation to create new variables on long-tailed distributions on residual sugar, cholorides, free sulfur dioxide, total sulfur dioxide, and sulphates.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The data is clean and well structured otherwise and does not require additional wrangling beyond the optional transformations listed above.

Bivariate Plots Section

Bivariate Analysis

Note: for this section I originally used the ggpairs function in the analysis. However, the equivalent plot from the psych package is presented in the Knit HMTL file for improved formatting.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • quality vs fixed acidity - no relation or small positive relation
  • quality vs volatile acidity - strong negative relation (higher quality has lower v acidity)
  • quality vs citric acid - strong positive relation (higher quality has higher citric acid content)
  • quality vs density - no or slight negative
  • quality vs ph - no or slight negative
  • quality vs alcohol - strong positive relation
  • quality vs residual sugar (log) - no relation
  • quality vs sulphates (log) - moderate positive relation
  • quality vs cholrides (log) - no or slighty negative relation
  • quality vs free sulfur dioxide (log) - no relation
  • quality vs total sulfur dioxide (log) - no relation

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Stronger Correlations above +/- .5
  • Citric acid vs fixed acidity*
  • Citric acid vs volatile acidity*
  • Density vs fixed acidity
  • pH vs fixed acidity*
  • pH vs citric acid*
  • total sulphur dioxide vs free sulphur dioxide*
  • Some of these relationships (items marked *) are to be expected

Moderate Correlations above +/- .25

  • Fixed acidity vs volatile acidity*
  • Density vs Citric Acid
  • Alcohol vs Density
  • Residual Sugar vs Density
  • Chlorides vs Density
  • Chlorides vs pH
  • Chorides vs Alcohol
  • Sulphates vs Volatile Acidity
  • Some of these relationships (items marked *) are to be expected

The supporting text gives several hints as to some interesting graphs.

  • Pleasant wines have lower levels of acetic acid (volatile acidity).
  • Fresh, fruity wines (pleasant) have higher levels of citric acid.
  • Sulfur dioxide helps up to a certain level, but higher levels are unpleasant.
    • All of these are supported by the patterns in the quality boxplots.
  • Density is related to alcohol, sugar, and salt content.
    • This is supported by the moderate values in the correlation matrix.

What was the strongest relationship you found?

The highest correlation was between free sulfur dioxide and total sulfur dioxide. After taking the log10 of each, the correlation was .785. However, this is to be expected as both are measures of sulfur dioxide.

The most useful relationships in determining wine quality appear to be volatile acidity, citric acid, pH, and alcohol.

Multivariate Plots Section

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 2 rows containing missing values (geom_point).

## 
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = rw)
## m2: lm(formula = quality ~ volatile.acidity + citric.acid, data = rw)
## m3: lm(formula = quality ~ volatile.acidity + citric.acid + alcohol, 
##     data = rw)
## m4: lm(formula = quality ~ volatile.acidity + citric.acid + alcohol + 
##     log.sulphates, data = rw)
## 
## ================================================================
##                        m1         m2         m3         m4      
## ----------------------------------------------------------------
##   (Intercept)        6.566***   6.529***   3.055***   3.444***  
##                     (0.058)    (0.089)    (0.194)    (0.196)    
##   volatile.acidity  -1.761***  -1.723***  -1.343***  -1.217***  
##                     (0.104)    (0.125)    (0.114)    (0.112)    
##   citric.acid                   0.063      0.068     -0.113     
##                                (0.115)    (0.103)    (0.103)    
##   alcohol                                  0.314***   0.303***  
##                                           (0.016)    (0.016)    
##   log.sulphates                                       1.518***  
##                                                      (0.181)    
## ----------------------------------------------------------------
##   R-squared             0.153      0.153      0.317      0.346  
##   adj. R-squared        0.152      0.152      0.316      0.344  
##   sigma                 0.744      0.744      0.668      0.654  
##   F                   287.444    143.812    246.976    210.808  
##   p                     0.000      0.000      0.000      0.000  
##   Log-likelihood    -1794.312  -1794.160  -1621.596  -1587.153  
##   Deviance            883.198    883.030    711.603    681.597  
##   AIC                3594.624   3596.320   3253.192   3186.306  
##   BIC                3610.756   3617.828   3280.078   3218.569  
##   N                  1599       1599       1599       1599      
## ================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I initially produced single plots comparing alcohol, citric acid, and volatile acidity, all colored by the quality. Due to the large number of “average” quality wines, it was diffcult to gain many visual cues. I decided to focus on the characteristics of higher quality wines (7-8) compared with lower quality (3-4). I subset the data accordingly and produced pairs of graphs with similar scales.

The lower quality wines in the sample have lower alcohol content and higher volatile acidity than the higher quality wines. However, the alcohol content for higher quality wines appears to be more evently dispursed. The lower quality wines also have a lower citric acid content coupled with the lower alcohol content and higher volatile acidity.

Were there any interesting or surprising interactions between features?

It was interesting to see the lack of interaction between alcohol and citric acid on the good wines. Good wines exist across most points in this plot.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I attempted a linear model to attempt to predict wine quality. A model using volatile acidity, citric acid, alcohol, and sulphates (log10) only had an R-sq of .346. Very little predictive power.


Final Plots and Summary

Plot One

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Description One

The vast majority of wines are average quality. It may be difficult to draw conclusions from such a narrow data set.

Plot Two

Description Two

This plot shows a nice inverse relationship between volatile acidity and quality. Both the median value and distribution of acidity decrease as quality increases.

Plot Three

## Warning: Removed 1 rows containing missing values (geom_point).

Description Three

For the lower quality wines, there are few redemptive properties. They are lower in citric acid, so they don’t have a taste as fresh and fruity as higher quality wines. But they also have a lower alcohol content. It would take more of this unpleasant beverage to achieve the desired level of “relaxation”.

For the higher quality wines, the plot of citric acid and alcohol is more evenly distributed. From a statistical point of view, this indicates little relationship between these variables at these quality levels. Perhaps other variables have stronger relationships with higher quality wines. However, from a practical standpoint, it could also mean that there are pleasant wines across many points in these scales. Personal preference may impact quality ratings.

Reflection

The wine quality data set looked at the chemical properties of 1599 red wines. I produced univariate, bivariate, and multivariate plots to attempt to find the variables with the highest impact on wine quality.

One area of frustration was that nothing really “jumped out”. Many of the plots did not indicate any relationship with wine quality. It was hard to feel confident in choosing any direction for further analysis. It was also difficult to draw many conclusions from this data set since over 80% of the wines are “average” quality. Any fitted models would likely have limited predictive power. It was also challenging to model what is really a discrete or categorical variable.

During the mutivariate analysis, I felt most confident in the analysis that split the data into higher and lower quality wines. That unconventional decision finally seemed to produce something interesting. The obvious drawback to this approach is that is discards most of the data. Splitting the data this way could be useful for future analysis.

As for additional future analysis, modeling a categorical dependent variable seems to be a better approach. Additional research would be needed to determine how to implement this in R. This would also be a good opportunity to consider addtional variables in the model to attempt to improve predictive power.